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We consider identification and estimation with an outcome miss¬ 
ing not at random (MNAR). We study an identification strategy 
based on a so-called shadow variable. A shadow variable is assumed 
to be correlated with the outcome, but independent of the missing¬ 
ness mechanism conditional on the outcome. We give a necessary 
and sufficient condition for identification of the full data law given 
a valid shadow variable under MNAR, and also sufficient conditions 
which are convenient to verify in practice. The conditions are satis¬ 
fied by many commonly-used models, and thus essentially state that 
lack of identification is not an issue in many situations. Focusing on 
estimation of an outcome mean, we describe three semiparametric es¬ 
timation methods: inverse probability weighting, outcome regression 
and doubly robust estimation. We evaluate the finite sample perfor¬ 
mance of these estimators via simulations, and apply them to a China 
Home Pricing dataset extracted from the China Family Panel Survey 
(CFPS). 


1. Introduction. Methods for missing data have received much at¬ 
tention in statistics and the social sciences. Data are said to be missing 
at random (MAR) if the missingness mechanism only depends on the ob¬ 
served data, otherwise, data are said to be missing not at random [MNAR, 
Rubin (1976)]. Considering inference with an outcome subject to missing¬ 
ness, it is well known that the underlying full data law and its functionals 
are identified under MAR, and corresponding methods to make inference 
abound, to name a few, likelihood based methods (Dempster, Laird and 
Rubin, 1977), imputation (Schenker and Welsh, 1988; Rubin, 2004), inverse 
probability weighting (Horvitz and Thompson, 1952; Robins, Rotnitzky and 
Zhao, 1994), and doubly robust methods (Van der Laan and Robins, 2003; 
Bang and Robins, 2005; Tsiatis, 2007). Among them, the doubly robust 
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approach is in principle most robust, since it requires correct specification 
of either a model for the full data law, or of the missing data mechanism, 
but not necessarily both; while direct likelihood or imputation methods rely 
on correct specihcation of the full data law, and likewise inverse probabil¬ 
ity weighting relies on correct specification of the missing data mechanism. 
Since doubly robust methods effectively double one’s chances to reduce bias 
due to model misspecification, such methods have grown in popularity in 
recent years for estimation with missing data and other forms of coarsen¬ 
ing problems, such as encountered in causal inference (Van der Laan and 
Robins, 2003; Bang and Robins, 2005; Tsiatis, 2007). 

Compared to MAR, MNAR is much more challenging. Under MNAR, 
even parametric models are often non-identifiable (Miao, Ding and Geng, 
2014; Wang, Shao and Kim, 2014). However, MNAR is of common occur¬ 
rence in practice when missingness depends on the missing values even con¬ 
ditional on the observed data. Several authors have studied the problem of 
identification under MNAR. Among them, Heckman (1979) proposed the 
so-called Heckman Selection Model, which includes a pair of models for the 
outcome and the selection process conditional on correlated latent variables 
to induce an association between the outcome and selection process. Little 
(1993, 1994) introduced methods based on a pattern-mixture parametriza- 
tion for incomplete data, which specifies the distribution of the outcome for 
each missing data pattern separately. Little studied identification of pattern- 
mixture models by imposing restrictions on unidentifiable parameters across 
different missing data patterns, for example, setting the parameters of the 
joint distribution of the data under a given pattern of missingness to be equal 
to a known function of parameters under a different pattern. Fay (1986) and 
Ma, Geng and Hu (2003) employed graphical models for the missing data 
mechanism and studied identification for categorical variables. Rotnitzky, 
Robins and Scharfstein (1998) and Robins, Rotnitzky and Scharfstein (2000) 
developed sensitivity analysis methods by assuming that the association be¬ 
tween the outcome and the missingness process is fully specified. Das, Newey 
and Vella (2003) and Tchetgen Tchetgen and Wirth (2013) gave sufficient 
identification conditions under MNAR in nonparametric and semiparamet- 
ric regression models with a valid instrumental variable, which is correlated 
to the missingness process but independent of the outcome in the underlying 
population. 

Identification is sometimes also possible, if a fully observed correlate of 
the outcome, is known to be independent of the missingness conditional on 
observed covariates and the outcome of interest. Such a correlate, which we 
refer to as an shadow variable, is available in many empirical studies. For 
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example, in a study of mental health of children in Connecticut (Zahner 
et ah, 1992; Ibrahim, Lipsitz and Horton, 2001), researchers were interested 
in evaluating the prevalence of students sampled from a metropolitan center 
with abnormal psychopathological status based on their teacher’s assess¬ 
ment, which was subject to missingness. As indicated by Ibrahim, Lipsitz 
and Horton (2001), a missing teacher report may be related to the teacher’s 
assessment of the student even upon adjusting for fully observed covariates 
which included parental status of the household and physical health of the 
child. Interestingly, a separate parent report on the psychopathology of the 
child was available for all children in the study. Such a report is likely highly 
correlated with that of the teacher, however, it is unlikely to be related to 
the teacher’s response rate conditional on her assessment of the student, 
in which case, the parental assessment constitutes a valid shadow variable. 
Another example we will consider in the application section of the paper 
concerns the China Home Pricing study. In the survey, residents were asked 
the current price of their home, which was unknown to some participants. 
Those who failed to respond, did so out of a lack of awareness about the cur¬ 
rent market value of their home because of the level of specialized knowledge 
about the real estate market required in some residential areas. Nonetheless, 
homeowners typically remembered the original market value of their home 
when it was either acquired or built. The original price is expected to be 
highly correlated with the current price, however, it is unlikely to affect a 
homeowner’s knowledge of the current price conditional on the latter and 
other covariates, such as geographic location of the home and the travel time 
to the closest business center. Thus the original price of a home may be se¬ 
lected as a shadow variable for homeowner nonresponse about the current 
value of his or her home. 

Even with a shadow variable, identification is not always guaranteed with¬ 
out an additional assumption. DHaultfoeuille (2010) studied identification 
with a shadow variable of a regression model with separable error and a 
nonparametric missing data mechanism, and presented nonparametric esti¬ 
mation methods. Wang, Shao and Kim (2014) studied identification allowing 
the outcome regression to be less restricted, however with a parametric miss¬ 
ing data mechanism, and proposed inverse probability weighted estimation 
for the mean of the outcome. Zhao and Shao (2014) studied identification of 
a generalized linear model with unrestricted missing data mechanism, and 
developed estimation methods based on pseudo-likelihood. 

Several methods of estimation initially developed under MAR have re¬ 
cently been extended to handle MNAR problems under suitable conditions. 
For instance, maximum likelihood estimation (Greenlees, Reece and Zi- 
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eschang, 1982), inverse probability weighted estimation (Scharfstein, Rot- 
nitzky and Robins, 1999), outcome regression based estimation (Scharfstein 
and Irizarry, 2003; Vansteelandt, Rotnitzky and Robins, 2007). In contrast, 
doubly robust estimation methods for data missing not at random are not 
as well developed. Although Scharfstein and Irizarry (2003) and Vanstee¬ 
landt, Rotnitzky and Robins (2007) proposed doubly robust estimators for 
the outcome mean, their approach entailed a sensitivity analysis in which the 
unidentifiable association between the outcome of interest and the missing¬ 
ness mechanism is assumed to be known, and the outcome mean is estimated 
over a range of values for the assumed association. 

In this paper, we consider identification and estimation with an outcome 
missing not at random, given a valid shadow variable. The shadow variable 
approach may be viewed as counterpart to an instrumental variable, and 
enjoys different identification properties also leading to different methods 
for estimation. We present a very general identification framework with a 
shadow variable, considering in turn identification under a selection model 
parametrization in Section 2, followed by identification under a pattern- 
mixture parametrization in Section 3. Our results provide for each setting 
necessary and sufficient conditions for identification of the full data law. In 
addition, we give general sufficient conditions for identification that are in 
principle more convenient to verify in nonparametric and semiparametric 
models. While we establish identification conditions of the joint distribution 
for the underlying full data, in order to simplify the exposition, we study in¬ 
ference in the context of estimation of the outcome mean, although the meth¬ 
ods we describe can be adapted for other functionals without much difficulty. 
In Section 4, we model the joint distribution of the outcome, the shadow 
variable and the missingness process via a pattern-mixture parametriza¬ 
tion. The joint distribution is factored into three parts: a baseline outcome 
model, which models the joint distribution of the outcome and the shadow 
variable in complete-cases; a baseline missing data mechanism, which mod¬ 
els the missingness mechanism at a baseline value of the outcome; and a 
log odds ratio model, which encodes the association of the outcome and 
the missingness mechanism. Based on such models, we propose a suite of 
estimators, that rely on correct specification of some but not necessarily all 
models. Specifically, we propose an inverse probability weighted estimator, 
a regression based estimator, and we also propose a doubly robust estimator 
for the outcome mean. Provided correct specification of the log odds ratio 
model, the doubly robust estimator is consistent if either the baseline out¬ 
come model or the baseline missing data mechanism is correctly specified. 
In contrast, the inverse probability weighted estimator requires a correct 
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model for the baseline missingness mechanism, and the ontcome regression 
based estimator requires a correct model for the baseline outcome model. 
We study the finite sample performance of the estimators in Section 5 via a 
series of simulations. Simulation results confirm the double robustness prop¬ 
erty of the proposed doubly robust estimator, and illustrate the bias of the 
other estimators when a required model is incorrect. In Section 6, we apply 
the methods to the China Home Pricing example. We conclude in Section 
7, and relegate proofs to the Appendix and supplemental discussions to the 
Supplementary Material. 

2. Identification nnder a selection model parametrization. 

Throughout the paper, we let V denote the outcome, R is the missingness 
indicator with 72 = 1 if T is observed, otherwise R = 0, and X denotes fully 
observed covariates. Suppose that one has also fully observed a variable Z 
that satisfies the following conditions of a shadow variable. 

Assumption 1. (a) Z IL R\ {Y,X); (b) Z jlY \ X. 

The assumption formalizes the idea that Z only affects missingness through 
its association with Y. We use lower-case letters for realized values of cor¬ 
responding variables, for example, y for a value of the outcome variable Y. 
For simplicity, we omit X in the following, in which case (a) and (b) become 
ZYR\Y and Z^Y. However, all the identification results will hold with co¬ 
variates by replacing the required conditions with their versions conditional 
on covariates. The observed data are n identically and independently dis¬ 
tributed samples, with some values of Y missing. Thus the observed data 
are (Z, RY, R). We use the symbol P to denote a joint density function, for 
instance, P{z, y, r) for the joint density of (Z, Y,R). A person’s contribution 
to the observed data likelihood is 

P{z,y,r = lYP{z,r = 0)^"'’, 

which represents all of the information contained in the observed data. The 
observed data likelihood is a functional of the joint distribution of (Z, Y, R), 
however, given the observed data likelihood, the joint distribution may not 
be uniquely determined. Let Vz,y,r denote a given model for the joint distri¬ 
bution of {Z,Y,R), which is a collection of all candidates of P{z,y,r). The 
joint distribution is said to be identifiable if and only if given the observed 
data likelihood, the element of Vz,y,r is uniquely determined. 

In this section, we study identification of selection models with a shadow 
variable. The selection model specifies the missing data mechanism and the 
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distribution of the outcome separately. Under Assumption 1, we factorize 
the joint density as 

P{z,y,r) = P(r\y)P{z,y). 

We first consider a binary data example, and demonstrate identihcation of 
the joint distribution. Then we extend to general cases and give a necessary 
and sufficient condition for identification. We also present a sufficient con¬ 
dition which is more straightforward to check in practice. We illustrate the 
conditions with a number of prominent examples. 

2.1. The binary case. Suppose Z and Y are binary. One is only able to 
identify the quantities P{z,y\r = 1), P{z\r = 0) and P{r = 1) from the 
observed data. These quantities are functions of the unknown parameters 
P{z,y) and P{r = l\y). Without imposing any further model assumptions, 
we have hve unknown parameters, and five independent estimating equa¬ 
tions. Intuitively, we can solve for the parameters from the available equa¬ 
tions to identify them, provided the solution is unique. The approach is 
formalized in the following theorem for binary data. 

Theorem 1. Suppose Assumption 1 holds, then the joint distribution of 
{Z,Y,R) is uniquely identified from the observed data {Z,RY,R). 

Theorem 1 was first established by Ma, Geng and Hu (2003), and the proof 
can be found there. Assumption 1 is indispensable for identification. The 
first part of the assumption says that Z is shadow in the sense that Z is not 
included in the missing data mechanism. It renders the number of unknown 
parameters equal to the number of available estimating equations, which is 
necessary for identification. The second part, in fact, imposes a full rank 
condition on the system of equations, and further guarantees uniqueness of 
its solution. 

2.2. General cases. For general outcomes and shadow variables, an ex¬ 
plicit rank condition is no longer available. However, as before, identification 
remains possible only for a subset of all data generating mechanisms which 
we characterize next. In this vein, we let Vriy, Pyiz and Pz denote the sets 
of laws that satisfy Assumption 1 for P{r = l\y), Piy\z) and P{z), respec¬ 
tively. Because Z is fully observed, P{z) is identified even when Pz includes 
all distributions for Z. However, we must further reduce membership for 
Pr\y or/and Py\z to guarantee identification. Identification of the latter 
models is determined by the relationship between their respective members. 
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Theorem 2. Under Assumption 1, the joint density P{z,y,r) is iden¬ 
tified from the observed data, if and only if Vr\y o.nd 'Py\z satisfy the 
condition: for any two pairs of elements Pfir = l\y),Pi{y\z) and P 2 {r = 
l\y), P 2 {y\z), their ratios are not equal, that is, with a positive probability 

Pi{y\z) , P 2 {r = l\y) 

P 2 {y\z) ^ Pi(r = l|y)' 

Theorem 2 presents a necessary and sufficient condition for identification 
of the joint distribution, and thus a sufficient condition for identification of 
its functionals, such as the outcome mean. It is quite convenient to check 
the conditions in Theorem 2 for parametric models. We further illustrate it 
with a binary example. 

Example 1. The binary case. We consider a saturated selection model 
with Vy\z = {P{y = l^) = 'nz,z = 0,1}. Then for any two candidates 
Piiy = l\z) = piz and P 2 {y = l\z) = r] 2 z, we have 

Pi{y\z) ^ rfifii - pizY-y 
P2{y\z) - V2zy-y' 

Note the assumption ZfiLY, implying pn ijiQ for i = 1,2. We establish in 
the Supplementary Material that the above ratio varies with z for different 
candidates of P{y\z). However, the ratio of any two missing data mecha¬ 
nisms is a function of y. Thus, the condition of Theorem 2 is satisfied, and 
therefore the joint distribution of {Z, Y, R) is identified. 

The conclusion reached in Example 1 is also consistent with Theorem 
1. The binary case has a very clear candidate set of models, however, for 
semiparametric and nonparametric models, explicit expressions for candi¬ 
date sets are generally not available, so that, the condition of Theorem 2 is 
not convenient to check. The following corollary provides a more practical 
approach to check for identification in such models. It follows from the fact 
that Pi{r = l\y)/P 2 {r = l|y) is a function of y, which is usually not equal 
to Pi{y\z)/P 2 {y\z), a function that varies with z. 

Corollary 1. For any two candidates Pi{y\z) and P 2 {y\z) ofVY\z, 
if the ratio Pi{y\z)/P 2 {y\z) is a constant or varies with z, then the joint 
distribution of {Z,Y, R) is identifiable under Assumption 1. 

Note that we do not specify the missing data mechanism in Corollary 1. 
Theorem 2 and Corollary I are not trivial, since we do have counterexamples 
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that fail to satisfy the conditions, such as the saturated model of trinary V 
with binary Z. However, the corollary still allows for many commonly-used 
models. We consider models for the case of a continuous shadow variable, 
for instance, the linear regression model: 

y = /3o + /3i^ + e, e\LZ. 


The model does not specify the distribution of the shadow variable as well of 
the error term, so the joint distribution is semiparametric with two param¬ 
eters (/3o,/3i) and infinite dimensional parameters Pg, Pz and P{r = l\y). 
Besides Assumption 1, the missing data mechanism is otherwise nonpara- 
metric. Nonetheless, the joint distribution of (Z, Y, R) is identifiable. One 
can likewise show that identification remains possible in the even larger 
nonparametric regression model with independent additive error. 

(1) Y = y{Z) + e, e\lZ, 

with IX unrestricted. Under regularity conditions, we can prove that Vy\z sat¬ 
isfies the condition of Corollary 1, and thus the joint distribution of [Z, Y, R) 
is identifiable. The identification of such models can be further strengthened 
by considering the heteroscedastic model: 


(2) 


Y = y{Z) + a{Z)e, elLZ, 


or equivalently 




y - 

a{z) 


with fx, a unrestricted, and unknown probability density function P^. This 
class of models is quite common in statistical analysis, such as linear and 
nonlinear regression. The model is identifiable under MNAR with the help 
of a shadow variable. 


Theorem 3. Suppose the heteroscedastic model (2) with a continuous 
shadow variable, and the density function of the error term P^ satisfies the 
regularity conditions in the Appendix and condition 

(a) for some linear and one-to-one mapping M : P{{e—a)/b} i —> G{t,a,b), 
and any b, b' > 0, (a, h) (o', 6'), 

lim G(t, a, b) / G{t, a', b') = 0 or oo for some to. 
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Then Vy\z satisfies the condition of Corollary 1, and thus the joint distri¬ 
bution of {Z, Y, R) is identified from the observed data under Assumption 1 
for an otherwise unrestricted model for the missingness process. 

Theorem 3 shows identification of a very large class of models. Identi¬ 
fication results of model (1) were previously established by DHaultfoeuille 
(2010), and model (2) allowing heteroscedastic error term can be viewed as 
a generalization of their results. It is particularly interesting because it es¬ 
sentially states that lack of identification is not an issue in many situations. 
Condition (a) of Theorem 3 imposed on the distribution of the error term 
states that the impact of the error term and that of the shadow variable 
on the outcome is distinguishable to some extent. It is satisfied by many 
commonly-used models, for instance, the normal error density with M be¬ 
ing the identity mapping and G being the density function itself, and the 
Student-t error density with M being the inverse Fourier transform and 
G being its characteristic function. For illustration, consider the following 
simple normal outcome model: let 'Py\z = {-^(vo + ^ 70)7i)<7^}, 

Vz = : /djCrl}. According to Theorem 3, the joint distribution 

of (Z, y, R) is identifiable under such outcome model with an unrestricted 
missing data mechanism. Note however, that one cannot understate the cen¬ 
tral role of the shadow variable in all examples given above. Without such 
a variable, identification is no longer guaranteed in these examples without 
imposing an alternative assumption. For instance, let us revisit the nor¬ 
mal outcome model in which the shadow variable is no longer available, 
i.e. Assumption 1 does not hold. Let Vy = : 7, Without 

further restrictions on the missing data mechanism, the joint distribution 
of (y, R) is not identifiable. Moreover, even if one were to assume a para¬ 
metric model for the missing data mechanism, identification is not guaran¬ 
teed. For example, as one assumes the Logistic missing data mechanism, 
V^Y = {logit P{r = l|y) = oq + otiy : ao,ai}, counterexamples to identi¬ 
fication are easy to find, see Miao, Ding and Geng (2014) and Wang, Shao 
and Kim (2014). 

3. Identification under a pattern-mixture parametrization. As 

an alternative to the selection model parametrization, the pattern-mixture 
parametrization factorizes the joint density of (Z, Y, R) as 

Piz,y,r) = P{z,y\r)P{r). 

As pointed by Little (1993, 1994), the parameters of P{z,y\r = 0) are not 
identifiable in general. However, he showed that identification can be re¬ 
covered by imposing sufficient restrictions on the parameters of the density 
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of V across different missing data patterns. Instead of imposing functional 
restrictions, here we introduce the log odds ratio function relating (Z, Y) 
and R. In a slight abuse of notation, we denote the log odds ratio func¬ 
tion by OR{y,z), which is dehned as a measure of the deviation between 
P{z,y\r = 1) and P{z,y\r = 0): 


(3) 


OR{z,y) = log 


P{z, y\r = 0)P{z 
P{z,y\r = l)P{z 


o,y = ok = 1) 

0,y = 0|r = 0) ’ 


Here (z = 0, y = 0) is a user-specified baseline value of {Z, Y), which may be 
set to any value in the support of (Z,Y). We note the following equivalent 
representation under Assumption 1 


OR{z,y) 


P(r = 0|y)P(r = l\y = 0) 
P{r = 0|y = 0)P{r = l|y) ' 


Under Assumption 1, the log odds ratio is only a function of y, which we 
therefore denoted OR{y). As shown above, the log odds ratio function also 
encodes the degree to which the missing data process departs from MAR, 
which corresponds to OR{y) = 0. The joint distribution of {Z, Y, R) is 
uniquely determined by OR{y), P{z,y\r), and P{r = l\y = 0). 


(4) P{z,y,r) = Ciex.p{{l-r)OR{y)}P{r\y = 0)P{z,y\r = 1) 
= C 2 exp{-rOP(y)}P(r|y = Q)P{z, y\r = 0), 


where C'i,C '2 are normalizing constants, and E[exp{OP(?/)}|r = 1] < 00 . 
The symbol E stands for expectation. One can verify that Ci = P{r = 
1)/P(r = l\y = 0), C 2 = P{r = 0)/P{r = 0\y = 0). 

Identification under a pattern-mixture parametrization depends on mod¬ 
els for the conditional density of {Z, Y) given R = 0 and the log odds ratio. 
We let Pyizfl and OTZy denote models for P{y\z,r = 0) and OR{y) re¬ 
spectively, that satisfy Assumption 1. We reduce the membership of Py\z,o 
or/and OTZy to guarantee identification. 


Theorem 4. Under Assumption 1, the joint density P{z,y,r) is identi¬ 
fiable, if and only ifPy\z,o ^.nd OTZy satisfy the following condition: for any 
two pairs of elements Pi{y\z,r = 0),ORi{y) and P 2 {y\z,r = 0),OR2{y), 
their ratios are not equal up to the constant Ci^ 2 , that is, with a positive 
probability 

Pi{y\z,r = 0) exp{OPi(y)} 

P2{y\z,r = Q)^ ^’^exp{OP2(y)}’ 
where C' 1^2 = E[exp{—OPi(y)}|r = 0]/E[exp{—OP 2 (y)}k = 0]. 
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The constant Ci ^2 is related to the log odds ratio functions ORi{y) and 
OR 2 {y) ■ An immediate sufficient condition is that the ratios presented in the 
theorem never equal up to any constant, and a corollary similar to Corollary 
1 follows. 


Corollary 2. If the ratio Pi{y\z,r = 0)/P 2 {y\z,r = 0) is a constant 
or varies with z for any two elements ofPY\zfl, then the joint distribution 
of {Z, Y, R) is identified under Assumption 1. 

Consider again the binary data case. Then we have the following result. 


Example 2. We consider a saturated pattern-mixture model Vy\z,o = 
{P{y = l\z,r = 0) = 9z,z = 0,1}. Then for any two candidates Pi{y = 
1|2:, r = 0 ) = 9iz and P 2 {y = l\z, r = 0 ) = 02z; we have 

Pi{y\z,r = 0) _ 9Ul-0izy-^ 

P 2 iy\z,r = 0) 9l^{l-92zY-y' 

Note the assumption ZjLY, we prove in the Supplementary Material that 
9il Y for i = 1,2, and that the above ratio must vary with z for different 
candidates of P{y\z,r = 0). Thus, the condition of Theorem f is satisfied, 
and therefore the joint distribution of {Z, Y, R) is identifiable. 


We can also consider the general heteroscedastic regression model with a 
continuous shadow variable: 


(5) 


P{y\z,r) 



P 


£,r 


y - yrjz) \ 
CTriz) J ’ 


r = 0,l. 


with fj,r and <7^ unrestricted, and P^^r, r = 0,1 are unrestricted density func¬ 
tions. Such models provide a sufficient condition for identification of the 
joint distribution of {Z,Y,R). 


Theorem 5. Suppose either P{y\z, r = 1) or P{y\z, r = 0) follows model 
(5) with the corresponding density function P^^r satisfying condition (a) of 
Theorem 3 as well as certain regularity conditions given in the Appendix. 
Then Vy\z,o satisfies the condition of Corollary 2, and thus the joint distri¬ 
bution of {Z, Y, R) is identified. 


Little (1993, 1994) studied identification of pattern-mixture models un¬ 
der various restrictions on the parameters of different patterns. For instance, 
the missing completely at random [MCAR, Little (1993); Little and Rubin 
(2002)] mechanism requires identical patterns: P{z,y\r = 1) = P(z,y\r = 0). 
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The parameter restriction approach may be particularly useful for paramet¬ 
ric models. But it requires precise knowledge of the data generating process, 
and may be more challenging to impose such restrictions in semiparametric 
and nonparametric models with infinitely dimensional parameters. However, 
the results we give above provide an alternative strategy to deal with such 
models by characterizing the largest class of models that are identifiable, 
which subsumes the approach of Little (1993). The proposed approach also 
enjoys the benefit that it is explicit about the key role played by the shadow 
outcome for identification, without which Theorem 5 fails to hold in general. 


4. Estimation. In this section, we mainly focus on inference about 
the outcome mean, denoted by /i = E(y), based on the pattern-mixture 
parametrization. We describe several strategies for estimation, each depend¬ 
ing on different modeling assumptions. We take covariates into consideration, 
and model the joint distribution of (Z, Y, R) conditional on X in three parts: 
the log odds ratio model 


( 6 ) 


OR{y\x) = log 


P{z, y\r = 0, x)P{z 
P{z, y\r = 1, x)P{z 


o,y = Ok = 

0,y = 0|r = 0,x)’ 


the baseline outcome model P{z,y\r = l,x), and the baseline missing data 
mechanism P{r = l\y = 0,x). The log odds ratio function does not depend 
on z by a similar argument to the case without covariates. We have the 
following parametrization of the joint density of (Z, T, R) conditional on X, 


(7) P{z, y, r\x) = c{x) exp{(l — r)OR{y\x)}P{r\y = 0, x)P{z, y\r = 1, x). 


with E[exp{Oi?(y|x)}|r = l,a:] < oo, and c{x) being the normalizing con¬ 
stant given X making the right hand side a well defined density function. We 
can express the missing data mechanism and the outcome model for the two 
patterns as functionals of P{z,y\r = l,x), P{r = l\y = 0,x) and OR{y\x). 


(8) P{r = l\y,x) = 

(9) P{z,y\r = 0,x) = 

(10) E(y|r = 0, x) = 


P{r=l\y=0,x) 

P{r=l\y=0,x)+ey.p{OR{y\x)}{P(r=0\y=0,x)} ’ 
eyip{OR{y\x)}P{z,y\r=l,x) 
E[exp{Oi?(y |x)}|r=l,a;] ’ 

E[exp{OR(y|a;)}Y|r=l,a;] 
E[exp{OK(Y|a:)}|r=l,a;] ' 


These equalities formulate the main parametrization we shall use throughout 
for estimation. They are straightforward to verify. We specify parametric 
models for the three parts separately: the baseline missing data mechanism 
P{r = l\y = 0,x;q!), the baseline outcome model P(z,y|x,r = l;/3), and 
the log odds ratio model OR{y\x-,^). 
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Suppose that the log odds ratio model is correctly specified. First, we de¬ 
scribe an inverse probability weighted estimator and an outcome regression 
based estimator, which require correct specification of the baseline missing 
data mechanism or the baseline outcome model, respectively. We then de¬ 
velop a doubly robust estimator, which is consistent if either the baseline 
missing data mechanism or the baseline outcome model is correct, but not 
necessarily both. 


4.1. Inverse probability weighted estimator. The inverse probability 
weighted estimator relies on the propensity score model P{r = 1|?/, x; a, 7 ), 
which is determined by P{r = l\y = 0,x;q:) and OR{y\x]'j) as shown 
in equation ( 8 ). We define the inverse of the propensity score model as 
w{y,x;a,'y) = 1/P{r = l|y,x;a, 7 ), and solve the equation for a and 7 ^^^ 

(11) P [{rc (X, Y■ S, R-l}h {X, Z)] = 0, 

with a user-specihed vector function h, for example, h{X,Z) = {X"^, Z)'^. 
The symbol P stands for sample mean. One may decide to allow the di¬ 
mension of h to exceed that of (a, 7 ), in which case one may adopt the 
Generalized Method of Moments (Hall, 2005). The proposed inverse proba¬ 
bility weighted estimator for the mean of the outcome is 

t^ipw — H* {u) {X, Y, o, ’yipw) RY^ . 

This estimator was previously described by Wang, Shao and Kim (2014). 


4.2. Outcome regression based estimator. As an alternative to inverse 
probability weighting, we can estimate the mean of the outcome by combin¬ 
ing the baseline outcome model and the log odds ratio model. We first ob¬ 
tain a consistent estimate for the baseline outcome model P{z, y\r = 1, x; /?), 
such as the maximum likelihood estimator, and then propose to solve the 
following estimating equation for ^reg 


(12) P [(1 - i?) {/i (Z, X) - E [h (Z, X) |r = 0, X; %eg] } 


= 0 , 


with previously obtained /3 and a user-specified vector function h. The con¬ 
ditional expectation is evaluated under the conditional density P{z\r = 
0, X, f3,^reg), which is determined by P{z, y\r = 1, x; /?) and OR{y\x] jreg) as 
in equation (9). The Generalized Method of Moments may also be used here 
whenever the dimension of h exceeds that of 7 . We then obtain an estimator 
E(y|r = 0,x; I3,^reg) from equation (10) for the outcome conditional mean 
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among the subset of the population with missing outcome, i.e. K(Y\r = 0, x). 
The regression based estimator is finally given by 

Jlreg = P {(1 - i?)E (y|r = 0, X; ^reg) + ^ (Y\r = 1, X; . 

Note that an alternative outcome regression-based estimator of ^ can be 
obtained by substituting T for E (y|r = 1, X; /3). 


4.3. Doubly robust estimator. The proposed doubly robust estimator re¬ 
quires a doubly robust estimator 7 ^^ for the log odds ratio parameter 7 , 
which solves the estimating equation 
(13) 


{w {X, Y; a, %r) R-l][h{Z,X)-¥.[h {Z, X)\r = 0,X- A %r] } 


= 0 , 


with previously obtained a, j3 and a user-specified vector function h. We then 
obtain an estimator E(y|r = 0,x', /3,^dr) from equation (10) for E(y|r = 
0,x). Finally, the proposed doubly robust estimator for the mean of the 
outcome is 




y-E 


w {X,Y-,a,%r) r\^ 

[Y\r = 0,X-,P,jdr} 


Y\r = 0,X; 


+ E 



The theorem below summarizes the consistency of the estimators proposed 
above. 


Theorem 6. Suppose the joint distribution of {Z, Y, R) conditional on 
X satisfies the identification condition in Theorem f; the estimating equa¬ 
tions (11), (12), (13) have a unique solution; and the log odds ratio model 
OR{y\x-,'y) is correctly specified. Then we have 

(a) if P{r = l\y = 0, x; a) is correctly specified, then the inverse probability 
weighted estimator fiipw is consistent; 

(b) if P{z,y\r = l,x;/5) is correctly specified, and (3 is consistent, then the 
outcome regression based estimator flreg is consistent; 

(c) if at least one of conditions (a) and (b) holds, then the doubly robust 
estimators fidr o,nd fidr are consistent. 

The estimators are also asymptotically normal under standard conditions 
(Hall, 2005). In the Supplemental Material, we derive nonparametric esti¬ 
mators of the asymptotic variance of each of these estimators, which are 
consistent even under model mis-specification. 
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The log odds ratio function OR{y\x) plays a central role for estimation 
under all three proposed strategies, as they all rely on a correct log odds 
ratio model. This is not entirely surprising, since as previously mentioned, 
the log odds ratio encodes the degree to which the outcome and the missing 
data process are correlated. Therefore, in order to estimate a population 
functional of T, one must first be able to account for its association with 
the missing data process. Note also that previous doubly robust estimators 
for missing data have assumed that this log odds ratio is known exactly, 
either to be equal to the null value of 0 under MAR (Bang and Robins, 
2005; Tsiatis, 2007; Van der Laan and Robins, 2003), or to be of a known 
functional form with no unknown parameter as in Vansteelandt, Rotnitzky 
and Robins (2007). Without a shadow variable, the log odds ratio function 
cannot generally be identified in the latter setting if outcome data are miss¬ 
ing not at random. Therefore, we have in fact developed a general strategy 
to relax these previous stringent assumptions. The doubly robust estimator 
provides us with a second chance to correct the bias due to possible mis- 
specification of either the baseline outcome model or the baseline missing 
data mechanism. We have shown that given a correct model for the log odds 
ratio function, one can be doubly robust both in estimating the association 
between the outcome and the missingness process, and the outcome mean. 
However, if both baseline models are incorrect, the doubly robust estimator 
will generally also be biased. 

5. Simulations. In this section, we study the finite sample perfor¬ 
mance of the proposed methods via simulations. We consider continuous 
and binary outcomes in turn, however, to save space we relegate simulation 
results for a binary outcome to the Supplemental Material. 

First, we generate a covariate X ~ iV(0,1), and then generate data from 
the conditional density (7) with a linear log odds ratio model: OR{y\x) = 
— 7 y, bivariate normal baseline outcome model and Logistic baseline missing 
data mechanism. We consider two choices for the baseline outcome model: 
Y\r = 1, X ~ N{/3io + /3ii(0.5x -|- 0.2x^), 1} with Z\y, x ~ N{/32o + /? 2 i( 2 x -|- 
+ I322y, 1}; and Y\r = 1, x ~ iV(/3io -L /Sux, 1) with Z\y, x ~ 1 V (^20 + 
/d 2 iT^ -|- I322y, 1); and we consider two choices for the baseline missing data 
mechanism: logit P{r = l\y = 0,x) = uo + ai{x — 0.5x^), and logit P{r = 
l|y = 0, x) = ao -|- cqx. The models are identifiable according to Theorem 5. 
Parameter values are set equal to (uoi cq) = (0-8) 0.5), (/3io, /3ii) = (0.5, 0.5), 
(/ 320 ) / 321 ) P 22 ) = (—0.5, 0.5,1), and 7 = —0.5. For these settings, the missing 
data proportions are between 40% and 45%. We generate data from the four 
combinations of the baseline outcome model and missing data mechanism, 
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but always employ a simpler model for estimation: Y\r = l,x ~ N{j3io + 
I3iix, 1) with Z\y, x ~ N{j32o+I32ix^+h 2 y■, 1), and logit P{r = l\y = 0, x) = 
ao + aix. We also consider a naive estimator assuming MAR obtained via 
linear regression based on complete data: K{Y\x,z) = E(y|r = l,x,z) = 
Poo + Poix + Po 2 Z- We simulate 1000 replicates under 500 and 1500 sample 
sizes for each combination and summarize the results with boxplots. 

Figure 1 presents the results for the outcome mean, Figure 2 presents the 
results for the parameter of the log odds ratio model. Table 1 shows coverage 
probability of the 0.95 confidence interval estimated with the method in the 
Supplementary Material. In (i) of Figure 1, the baseline missing data mech¬ 
anism is incorrect but the baseline outcome model is correct. As a result, 
the outcome regression based estimator works well and has an appropri¬ 
ate coverage probability, but the inverse probability weighted estimator has 
very large bias and coverage probability well below the nominal level. It 
is not consistent as the sample size increases. In (ii), the baseline missing 
data mechanism is correct but the baseline outcome model is incorrect. The 
inverse probability weighted estimator has small bias and has an approxi¬ 
mate 0.95 coverage probability, but the outcome regression based estimator 
is biased. However, in (i) and (ii), the doubly robust estimator performs the 
best with smaller bias and approximate 0.95 coverage probability, and it is 
consistent as sample size increases. In (iii), both models are correct, and all 
proposed estimators have similar performance and are all consistent. In (iv), 
neither of the two models is correct, but the doubly robust estimator has 
smaller bias than others. We also observe that as expected, the naive esti¬ 
mator assuming MAR is significantly biased in all cases. The performance 
of the estimators for the log odds ratio model is similar to the estimators for 
the mean. The results show robustness of the doubly robust estimator. As a 
conclusion, we recommend the doubly robust approach for inference about 
the mean parameter as well as to evaluate the magnitude of selection bias. 

6. China Home Pricing example. The dataset is extracted from 
China Family Panel Studies (CFPS). The CFPS project is conducted by In¬ 
stitute of Social Science Survey of Peking University in Beijing since 2008. 
The survey data contain information on various aspects of the household, 
such as socioeconomic information and education information of the house¬ 
holds. Details of the survey can be found at http://www.isss.edu.cn/ 
cfps/EN/. The dataset we used consists of 3126 households from Liaoning, 
Shanghai, Henan, Guangdong and Yunnan province of China. The provinces 
are located to the east of the Heihe-Tengchong line, a line drawn from the 
town of Heihe in the northeast province of Heilongjiang to Tengchong in the 
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Fig 1: Boxplots of the estimators for the mean of a normal ontcome. 
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Fig 2: Boxplots of the estimators for log odds ratio model of the normal 
outcome example. 

Note for Fig 1 and 2: data are analyzed with four methods: doubly robust estimation 
(DR), inverse probability weighting (IPW), regression based estimation (REG), and 
linear regression based on complete data (CMP). In each boxplot, white boxes are for 
sample size 500, and gray ones for 1500. The horizontal line marks the true value of the 
parameter. 
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Table 1 

Coverage probability of 0.95 confidence interval for a normal outcome. 







7 



DR 

IPW 

REG 

DR 

IPW 

REG 

FT 

0.953 

0.748 

0.942 

0.960 

0.771 

0.956 


0.951 

0.359 

0.944 

0.940 

0.374 

0.960 

TF 

0.944 

0.944 

0.302 

0.957 

0.952 

0.188 


0.959 

0.943 

0.013 

0.955 

0.946 

0.000 

TT 

0.949 

0.944 

0.940 

0.952 

0.954 

0.954 


0.947 

0.943 

0.934 

0.948 

0.952 

0.955 

FF 

0.762 

0.760 

0.235 

0.837 

0.721 

0.287 


0.441 

0.410 

0.005 

0.661 

0.247 

0.007 


Note: Coverage probability of the proposed estimators under four simulation situations: 
FT stands for incorrectly specified baseline missing data mechanism and correctly 
specified baseline outcome model, and the other three situations are similarly defined. 
The variance and confidence interval estimation method can be found in the 
Supplementary Material. The result of each situation includes two rows, of which the 
first row stands for sample size 500, and the second for 1500. 


southwest province of Yunnan. The Heihe-Tengchong line divides the area 
of China in half. But only 6% of the population lives in the west; 94% of 
the population lives in the eastern half of the country [page 19, Naughton 
(2007)]. Shanghai and Guangdong are located in the southeast coast. They 
are two most developed provinces of China, but are suffering from conges¬ 
tion, pollution and extortionate home prices due to dense population. The 
other three provinces, Liaoning located in the northeastern, Henan located 
in the middle, Yunnan located in the southwest of this area, are three less de¬ 
veloped provinces, and the Chinese government is committed to accelerating 
development of this area and improving quality life of its residents. Against 
the background of booming real estate, home price is a central issue for both 
the residents and the government. In the survey, the variable of interest is 
log of current home price (Ihmpr, in 10^ RMB yuan). Homeowners were 
asked the current price of their house, to which 587(18.8%) householders 
responded “don’t know”, and 9(3%) refused to respond. Therefore they are 
treated as missing values in our analysis. The dataset contains completely 
observed covariates: prov (1, developed province, 0 less developed province), 
urban (1, urban, 0 rural household), Itm (log of travel time to the nearest 
business center, in minutes), Isiz (log of house building area, in square me¬ 
ters), Ifmsz (log of family size, i.e. number of families living together), fir 
(which floor the family live in), If mine (log of last year family income, in 
RMB yuan), recons (ever experienced demolishment and reconstruction), 
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and loripr (log of original home price, in 10^ RMB yuan). 

In developed areas and big cities, besides building costs, home prices are 
also affected by many factors: location issues such as bus stops and busi¬ 
ness center, property management such as maintenance and garden care, 
the strength of the competition, and renovation potential. As a result, home 
prices are relatively high and valuation is complicated due to complexities of 
real estate market. Home price valuation usually relies on dedicated agen¬ 
cies who have specialized knowledge and rich experience with real estate 
transactions. A householder is seldom well aware of the valuation of their 
home. However, in less developed areas, real estate market is pristine and 
slow-growing. There are much fewer factors that affect home prices. As a 
result, home prices tend to be lower and valuation depends mostly on build¬ 
ing costs. Valuation does not need much specialized knowledge, and in fact, 
transactions and valuation in these areas seldom rely on appraisal compa¬ 
nies. Therefore, nonresponse of Ihmpr is likely related to the current value 
of the home, and the missingness is likely not at random. The original price 
of the house is related to the current price, however, we expect that the orig¬ 
inal price is independent of nonresponse conditional on the current value of 
the house and fully observed covariates. In other words, we assume that two 
homeowners with houses of equal current value and common covariates do 
not differ in their probability of nonresponse even if the original purchase 
price of their respective homes differs. Therefore, we use loripr as a shadow 
variable. 

Let X denote all other covariates including 1 for the intercept, we assume 
a linear log odds ratio model: Oi?(lhmpr|x) = —ylhmpr, a Logistic baseline 
missing data mechanism: 

logit P{r = l|x, Ihmpr = 0) = xa, 
and a bivariate baseline outcome model: 

E(lhmpr|r = l,x) = x/3i, 

E(loripr|r = l,x, Ihmpr) = x/32i -|- /322lhinpr. 

We summarize the results of mean of Ihmpr and the log odds ratio in 
Table 2, and Table A.2 summarizing model fits for all baseline models is 
relegated to the Supplementary Material. We first apply standard estima¬ 
tion methods assuming missing at random (R_lllhmpr|V, loripr) to the 
data. From Table 2, the standard regression estimate for mean and 95% 
confidence interval (in bracket) of Ihmpr is 2.693(2.637,2.749) using a lin¬ 
ear regression, and the standard inverse probability weighted estimate is 
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2.694(2.638,2.751) using a Logistic propensity score model. Details of stan¬ 
dard estimates can be found in the Supplementary Material. However, the 
estimates using the proposed methods are somewhat different from those 
using standard methods. The estimate is 2.595(2.530,2.659) using the dou¬ 
bly robust approach, 2.586(2.518,2.655) using regression based estimation, 
and 2.599(2.533,2.665) using inverse probability weighting; the correspond¬ 
ing estimates for the log odds ratio parameter 7 are 0.497(0.329,0.664), 
0.745(0.430,1.060) and 0.413(0.241, 0.586). In the above, the three proposed 
methods produce similar point and interval estimates for the outcome mean. 
None of the three confidence intervals includes point estimates from standard 
methods, and none of the three point estimates falls in the confidence inter¬ 
vals from standard methods. Doubly robust estimation and inverse proba¬ 
bility weighted estimation also produce similar point and interval estimates 
of the log odds ratio parameter 7 . Although regression based estimation pro¬ 
duces a slightly higher result, all three confidence intervals of 7 exclude 0 , 
providing significant empirical evidence of selection bias due to the outcome 
missing not at random. There is also substantial empirical evidence that 
homeowners with houses valued at lower prices are less likely to respond to 
the survey. 


Table 2 

China Home Pricing example. 




Mean (/i) 


Log odds ratio ( 7 ) 

DR 

2.595 

(2.530, 2.659) 

0.497 

(0.329, 0.664) 

REG 

2.586 

(2.518, 2.655) 

0.745 

(0.430, 1.060) 

IPW 

2.599 

(2.533, 2.665) 

0.413 

(0.241, 0.586) 

CMP 

2.693 

(2.637, 2.749) 



marlPW 

2.694 

(2.638, 2.751) 




Note: Point estimates and 95% confidence intervals of the outcome mean and log odds ratio 
parameter. CMP and marlPW stand for standard regression estimation and standard inverse 
probability weighted estimation. 


7. Discussion. The instrumental variable approach is a popular and 
fairly well developed method for identification of data missing not at random 
(Das, Newey and Vella, 2003; Tchetgen Tchetgen and Wirth, 2013). In con¬ 
trast, although recently considered by DHaultfoeuille (2010); Wang, Shao 
and Kim (2014); Zhao and Shao (2014) in a series of seminal papers, the use 
of a shadow variable as an alternative identification strategy in the context 
of data missing not at random is currently not well established in the missing 
data literature. In this paper, we further explore the use of such a variable, 
and we establish a general framework for identification, together with a suite 
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of semiparametric estimators, including a doubly robust approach. The basic 
idea of a shadow variable approach is to reduce the unidentifiable parame¬ 
ters or candidate models. It is fairly straightforward to illustrate for simple 
situations but becomes more involved and less convenient for larger, more 
complicated models such as semiparametric or nonparametric models. We 
presented necessary and sufficient conditions for identification under a se¬ 
lection model parametrization, and a pattern-mixture parametrization. One 
may choose to verify the conditions according to which parametrization may 
be most convenient or appropriate in a given application. As we observed 
that the sufficient conditions we give are satisfied by many commonly-used 
models, lack of identification is not an issue in many situations with a shadow 
variable. We also developed semiparametric estimation methods that extend 
analogous methods available when data are missing at random. 

A few remaining open questions are of significance but not within the 
scope of the paper. First, the choice of the function h indexing various esti¬ 
mating equations is directly related to the efficiency of the resulting estima¬ 
tors. Modern semiparametric efficiency theory may be used to identify an 
optimal choice for such index functions, and therefore to construct a semi¬ 
parametric locally efficient estimator (Bickel et ah, 1993; Van der Vaart, 
2000). Second, one may note that in the construction of our doubly ro¬ 
bust estimator, the shadow variable was not included as predictor for the 
outcome model. However, the shadow variable may in fact be treated as 
any ordinary predictor for the purpose of modeling the outcome, and the 
efficiency-robustness trade-off of incorporating such a variable in the out¬ 
come model remains to be investigated. 

The proposed identification and doubly robust estimation methods have 
potential application in other related topics. The methods can be applied 
to longitudinal data analysis, which is often subject to dropout or missing 
data. Their potential use for such more general settings is the topic of future 
study. 


APPENDIX A: PROOFS 

Proof of Theorem 2 

Proof. The proof is based on contradiction. First, we note that all the 
quantities we can identify from the observed data are P{z, y,r = 1) and 
P{z). Suppose we have two candidate models satisfying the same observed 
quantities: 

Pi(^,y, r = 1) = P 2 {z,y,r = 1), Pi{z) = P 2 {z) almost surely. 
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By the shadow variable assumption, we have the factorization for the joint 
density: 

Pi{z, y, r) = Pi{r\y)Pi{y\z)Pi{z), i = 1,2. 

So from the above identities we have 

Pi{r = l\y)Pi{y\z) = P 2 {r = l\y)P 2 {y\z), 


and equivalently 


-Pi(r = l\y) 
P2{r = l|y) 


P2{y\z) 

Pi{v\z) 


almost surely. 


The equation contradicts the condition that the candidates of Vr\y and 'Py\z 
make the ratios unequal. So, that unequal ratios is equivalent to the impos¬ 
sibility of two sets of candidates satisfying the same observed quantities, i.e. 
identification of the joint distribution. □ 


The proof of Theorem 3 relies on the following lemma. 


Lemma 1. If any element ofVY\z satisfies: 

(14) J P{y\z)h{y)dy = 0 for any z h = 0, 

then the ratio of any two elements varies with z, i.e. for any Pi{y\z), P 2 {y\z) G 
Py\z, Pi{y\z) / P 2 {y\z) varies with z. 

Proof. We prove the lemma by contradiction. Suppose we have two can¬ 
didates Pi{y\z) and P 2 {y\z) that Pi{y\z)/P 2 {y\z) = h{y) for some function 
h. Then we have 

J P 2 {y\z)dy = J Pi{y\z)dy = J P 2 {y\z)h{y)dy = 1, 

and thus f P 2 (ylz){h(y} — Ijdy = 0. Since h — 1 0, the above equation 

contradicts condition (14). As a result, the ratio of any two elements of Vy\z 
must vary with z. 

□ 


Condition (14) is called completeness in Hu and Shiu (2011) and DHault- 
foeuille (2010). Before we prove Theorem 3, we introduce the following 
lemma about completeness, which is part of Lemma 4 of Hu and Shiu (2011). 
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Lemma 2. For every z, P{y\z) G Lp', and suppose that there exists a 
point zq with its open neighborhood U{zo) such that 

(i) the characteristic function (fzoit) of P{y\zQ) satisfies 0 < \(j)zQit)\ < 

C exp(— for i G M and some constants C,5 > 0; 

(a) dP{y\z)/dz for all z G U{zo) and dP{y\zo)/dy are in Lp; 

(Hi) there exists a sequence {zk G U{zq) : k = 1,2,...} converging to zq 
such that any finite subsequence P{y\zki) is linearly independent, i.e. 

I 

''^^aiP{y\zk^) = 0 implies ai = 0, for all i = 1,2,I. 

i=l 

Then P(y\z) is complete in Lp, i.e. condition (14) holds. 

We prove Theorem 3 under the following regularity conditions for the 
density function of the error term 

(1) the characteristic functions (j){t) of P{v) satisfies 0 < \4>{t)\ < Cexp(— (5|t|) 
for t G M and some constants C, 5 > 0; 

(2) p.{z), <7(z) and P(v) are continuously differentiable, and lv-dP(v)/dvl^dv 
is finite. 

Proof of Theorem 3 


Proof. From Lemma 1, in order to prove Theorem 3, we only need to 
prove that condition (14) holds under the regularity conditions above and 
condition (a) of Theorem 3. 

Without loss of generality, we choose zq that a(zo) = 1. Since y is con¬ 
tinuous with dy{z)/dz 7 ^ 0 for z = zq, there exists a distinct sequence 
{zk C U (zo) : A; = 1, 2,3,...} that y{zk) converges to y{zo). Condition (i) of 
Lemma 2 is satisfied under regularity condition (1). 

We then check that P{y\z) G for every z. We have 


\P{y\z)\'^dy 


f y - h{z) \ 

\ or{z) j 



Peie)\^de. 


Since |(/)(t)| < Cexp(—(5|t|), we have J^\(l){t)\‘^dt < +00, and thus f^\P£{e)\‘^ de < 
+00. Note that a{z) > 0, we have P{y\z) G for all z. From the regular¬ 
ity condition (2), one can straightforward verify that dP{y\z)/dz for all 
z G U{zo) and dP{y\zo)/dy are in L^. 

We verify condition (iii) of Lemma 2 under condition (a) of Theorem 3. 
Without loss of generality, we assume that for /c = 1, 2,..., / < -|-oo, and 
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{ofe G M : /c = 1,2,... 

aiP{y\zi) + a 2 P{y\z 2 ) H-h aiP{y\zi) = 0. 

Since the mapping M is linear and one-to-one, we have 


aiG{t\zi) P CL 2 G{t\z 2 ) -l- • • • -l- a];G(t\zj) — 0, 


with G{t\zi) = G{t, fi{zi),a{zi)}, and thus for t G {t : G{t\zi) / 0} 

, G{t\z 2 ) , , G{t\zi) ^ 

ai + 0,2 - V “I -+ “/ TTTTi \ — 0, 

G[t\zi) G{t\zi) 

Note condition (a) of Theorem 3, we assume without loss of generality that 

^ * = 2 , • • •, 

t^to (-T(^t\Zl) 


then we must have ai = 0. In the same vein, we can prove that a* = 0 for 
i = 2,..., which means {P{y\zi) : i = 1..., is linearly independent. 
Therefore condition (hi) of Lemma 2 holds under condition (a) Theorem 3. 

Finally, conditions (i), (ii) and (hi) hold under the regularity conditions 
(1), (2) and condition (a) of Theorem 3. Thus, condition (14) holds. □ 


Proof of Theorem 4 


Proof. The proof is by contradiction. We note that the quantities we can 
identify from the observed data are P{z,y,r = 1) and P{z\r = 0). Suppose 
we have two candidate models satisfying the same observed quantities: 

Pi{z,y,r = 1) = P 2 {z,y,r = 1), Pi( 2 ;|r = 0) = P 2 {z\r = 0) almost surely. 

By the pattern-mixture parametrization (4), we have 

Pi{z, y, r) = G 2 i exp{-rORi{y)}Pi{r\y = 0)Pi{z, y\r = 0), i = 1, 2, 

where G 2 i is the normalizing constant. From the above identities we have 


C' 2 iexp{-Oi?i(y)}Pi(r = l\y = 0)Pi{y\z,r = 0) 

= G 22 exp{-OR 2 {y)}P 2 ir = l\y = 0 )P 2 {y\z, r = 0), 

Pi(r = 1) = P 2 {r = 1). 

Note E[exp{—Oi2j(?/)}] = Pi{y = 0|r = 0)/Pi{y = 0|r = 1), we have 
= 11, = 0) = P,(r = = ‘I*' = 


Pi{r = 0|y = 0) E[exp{-Oi2i(y)}|r = 0] ’ 
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and thus almost surely 

Pi{y\z,r = 0) _ E[exp{-Ofii(y)}|r = 0] exp{O.Ri( 7 /)} 

P 2 {y\z,r = 0) E[ex.p{-OR 2 {y)}\r = 0] ex.p{OR 2 {y)}' 

The equation contradicts the condition that the candidates of Py\z,o 
OTZy make the ratios unequal. So, that unequal ratios is equivalent to the 
impossibility of two sets of candidates satisfying the same observed quanti¬ 
ties, i.e. identification of the joint distribution. □ 

The proof of Theorem 5 relies on the following lemma. 

Lemma 3. If any element ofPY\z,o satisfies: 

(15) J P{y\z,r = 0)h{y)dy = 0 for any z => h = 0, 

then the ratio of any two elements varies with z, i.e. for any Pi{y\z,r = 
0 ),P 2 iy\z,r = 0) G PY\Zfl, Pi{y\z,r = 0 )/P 2 {y\z,r = 0) varies with z. 

Proof. We prove the lemma by contradiction. Suppose we have two can¬ 
didates Pi{y\z, r = 0) and P 2 {y\z, r = 0) such that Pi{y\z, r = 0)/P2{y\z, r = 
0) = h{y) for some function h. Then we have 

j P 2 {y\z,r = 0)dy = j Pi{y\z,r = 0)dy = j P 2 {y\z,r = 0)h{y)dy = 1, 

and thus f P 2 {y\z,r = 0){h{y) — l}dy = 0. Since h — 1 0, the above 

equation contradicts condition (15). As a result, the ratio of any two elements 
of Py\z,o must vary with z. 

□ 


Proof of Theorem 5 

Proof. We prove Theorem 5 under the regularity conditions (1) and (2). 

(i) We hrst assume that P{y\z,r = 1) follows model (5). We consider h 
that 

(16) J P{y\z,r = 0)h{y)dy = 0 for any z. 

From equation (9), we have 

r = o) = = f)= f) 

’ E[exp{Oi?(y)}|r = 1] P(z|r = 0)’ 
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and thus from equation (16), f F(i/lz,r = 1) exp{OR(y)}h(y)dy = 
0 for any z. Since P{y\z, r = 1) follows model (5), by the same proof as 
that of Theorem 3 we have exp{OR{y)}h{y) = 0, and thus h{y) = 0. As 
a result, we have proved that f P{y\z, r = 0)h{y)dy = 0 for any z =► 
h = 0, and by Lemma 3, the ratio of any two elements of Vyizfl varies 
with 2 . 

(ii) We then assume that P{y\z,r = 0) follows model (5). By the same 
proof as that of Theorem 3 we have 

J P(y\z,r = 0)h{y)dy = 0 for any 2 =► h = 0, 

and thus by Lemma 3, the ratio of any two elements of Py\z,o varies 
with z. 


□ 


Proof of Theorem 6 

Proof. Since we assume the joint distribution of {Z, Y, R) conditional on 
X satisfies the identihcation condition in Theorem 2, the joint distribution 
is identifiable. Assuming the estimating equations (11), (12), (13) have a 
unique solution, we only need to prove that the equations are asymptotically 
unbiased. We use the subscript to denote the probability limit of the 
estimators. 

Proof of (a). By the law of iterated expectations, we have 

E [{u; {X, Y; a,j)R-l}h (X, Z)] 

= ¥.[¥.{[w{X,Y-a,-i)R-l]h{X,Z)\X,Y]]. 

Assumption 1 implies that 

E{[w {X,Y;a,-f)R-l]h{X, Z)\X, T} 

= E{[w{X,Y;a,-f)R-l] \X,Y}E{h {X, Z)\Y,X} 

= E{rc (X, Y; a, 7 ) E{R\X, Y) - l}E{h (X, Z)\Y, X}. 

Since P(r = l\y = 0,x;a) and OR{y, x]^) are correctly specified, under 
the true values of a and 7 , we have w {X,Y-, Uj'j) = l/E(i?|X, T), and 
thus E [{re (X, Y;a,'y)R—l}h (X, Z)] = 0, which is the probability limit of 
equation (11). So equation ( 11 ) is asymptotically unbiased. 
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Proof of (b). By the law of iterated expectations, we have at the true 
value of f5 and 7 

E{{1 - R)h{Z,X)\X} = P{r = 0\X)E{h{Z,X)\r = 0,X} 

= P{r = 0\X)E{h{Z,X)\r = 0,X-,l3,-f} 

= E[{l-R)E{h{Z,X)\r = 0,X;f3,^}\X], 

thus, 

E [(1 - R) {h {Z, X)-E[h {Z, X) \r = 0, X; 7 ]}] = 0. 

Therefore equation (12) is asymptotically unbiased and ^reg is consistent. 
As a result, E{y|r = 0,x;/3,7reg} is consistent with E(y|r = 0,x), and thus 
by the law of large numbers, 'flreg converges to 

E {(1 - R)E {Y\r = 0, x) + RE {Y\r = 1, x)} = E(y), 

which means 'jlreg is consistent. 

Proof of (c). Provided OR{y,x-,'y) is correctly specified, and f3 is consis¬ 
tent, we prove that equations (13) are asymptotically unbiased if either (a) 
or (b) holds. 

(i) If P(r = l|y = 0, x; a) is correctly specified, then a is consistent as we 
have proved that equations ( 11 ) are asymptotically unbiased in (a). 
By the assumption: Z_llP|(y, A), we have 

E[{u;(X, y; a, 7 )^ - l]{h{Z, X) - E[h{Z, X)|r = 0, A; ^, 7 ]}] 

= E[E{u;(A,y;a, 7 )P-l|A,y} 

xE{h{Z,X)-E[h{Z,X) \r = 0 ,A;/ 3 *, 7 ] |y,A}], 

which equals 0 since E {tc (A, Y; a,^) R — 1| A, y} = 0. 

(ii) If P{z, y\r = 1, x; /3) is correctly specified, and /3 is consistent, we have 
E{h{Z,X)\r = 0,X} =E{h{Z,X)\r = 0,X-,l3,-f}. Note that 

E[{u;(A, y; a*,j)R - l}{h{Z, A) - E[h{Z, X)\r = 0, A; /3, 7 ]}] 

= E[R{w (A, y; a*, 7 ) - 1} {h (Z, X)-E[h (Z, A) |r = 0, A; /3, 7 ]}] 
-E [(1 - R) {h (Z, X)-E[h (Z, A) |r = 0, A; /?, 7 ]}] . 

The second term of the right hand side equals 0 as we have proved 
in (b). We only need to prove that the first term also equals 0. From 
equation (3) we have 

P(t = Olzy = 0 ct*) 

R{w{X,Y-a*,^) - 1} = ^exp{OP(y|A;7)}^,|^:^yLl^. 
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and note equation (9) 


P{z,y\r = 0,x) 


eici>{OR{y\x)}P{z,y\r = l,x) 
E[exp{Oi?(y|x)}|r = l,x] ’ 


we have 


'&{h{Z, x)\r = 0, x] 


So we have 


h{z,x)P{z,y\r = 


0 , x)dzdy 


f exp{OR{y\x)}P{z,y\r = 1, x)h{z, x)dzdy 
E[exp{Oi?(y|x)}|r = l,x] 
'K[ex.p{OR{Y\x)}h{Z,x)\r = l,x] 
E[exp{Oi?(y|x)}|r = l,x] 

E[i? exp{Oi?(y|x)}/i(y, x)|x] 
E[i?exp{Oi?(y|x)}|x] ■ 


E[i2exp{Oi?(y|x)}/i(Z, x)|x] = E[i? exp{Oi?(y|x)}|x]E{/i(Z, x)|r = 0, x}, 
and thus 


E[i?exp{Oi?(y|x)}{/i(Z, x) — E[/i(Z, x)|r = 0, x]}|x] = 0, 


E 


P{r = 0|y = 0, X; a*) 


i?exp{Oi?(y|X;7)} 


P{r = l\y = 0, X; a*) 
{h(Z,X)-E[h(y,X)|r = 0,X;/3,7]} 


= 0 . 


By the law of iterated expectations, we have at the truth of j3 and 7 


P{r = 0|y = 0, X; a*) 
P(r = l\y = 0, X; a*) 


Pexp{OP(y|X;7)} 


•{h(Z,X)-E[h(Z,X)|r = 0,X;/3,7]} 


= 0 . 


So the first term must equal 0. Therefore, 7 ^^ is doubly robust. 

By similar arguments as (i) and (ii) with h{Z, X) replaced by Y, we 
can prove that under the conditions of (a) 


(17) E[{rn(X, y; a, 7)^ - l}{y - E[y|r = 0, X; /?*, 7]}] = 0, 
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and thus 

fi = E[w (X, Y- a,j)R{Y-E [y|r = 0, X; 13*, j]} + E {y|r = 0, X; ^, 7 }] 
and under the conditions of (b) 

(18) E[{u;(X, y; a*,j)R - l}{y - E[y|r = 0, X; /3, 7 ]}] = 0, 

and thus 

fi = E[w (X, Y-,a*,j)R{Y-E [y|r = 0, X; /3, 7 ]} + E {y|r = 0, X; /3, 7 }] ■ 
Therefore, Jifir is doubly robust for the mean of the outcome. 

□ 


SUPPLEMENTARY MATERIAL 

Supplement to “Identification and Doubly Robust Estimation 
of Data Missing Not at Random With a Shadow Variable” 

(doi: COMPLETED BY THE TYPESETTER; .pdf). This supplement con¬ 
tains variance estimation and additional details on inference, details of Ex¬ 
ample 1, 2, simulation results for a binary outcome, and details on model 
fits for the real data example. 
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